gaussian copula
- Europe > Austria > Vienna (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > United States > Alabama (0.04)
- (3 more...)
- Government (0.46)
- Banking & Finance (0.46)
- Transportation > Passenger (0.46)
- Transportation > Ground > Road (0.46)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Leisure & Entertainment > Games > Computer Games (0.93)
- Information Technology (0.68)
ProbabilisticMissingValueImputation forMixedCategoricalandOrderedData
Social survey datasets, for example, are typically mixed because they include variables like age (continuous), demographic group (categorical), and Likert scales (ordinal) measuring how strongly a respondent agrees with certain stated opinions. Continuous variables are encoded as real numbers and sometimes called numeric. We refer to variables that admit a total order (e.g.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > France (0.04)
Copula Based Fusion of Clinical and Genomic Machine Learning Risk Scores for Breast Cancer Risk Stratification
Aich, Agnideep, Hewage, Sameera, Murshed, Md Monzur
Clinical and genomic models are both used to predict breast cancer outcomes, but they are often combined using simple linear rules that do not account for how their risk scores relate, especially at the extremes. Using the METABRIC breast cancer cohort, we studied whether directly modeling the joint relationship between clinical and genomic machine learning risk scores could improve risk stratification for 5-year cancer-specific mortality. We created a binary 5-year cancer-death outcome and defined two sets of predictors: a clinical set (demographic, tumor, and treatment variables) and a genomic set (gene-expression $z$-scores). We trained several supervised classifiers, such as Random Forest and XGBoost, and used 5-fold cross-validated predicted probabilities as unbiased risk scores. These scores were converted to pseudo-observations on $(0,1)^2$ to fit Gaussian, Clayton, and Gumbel copulas. Clinical models showed good discrimination (AUC 0.783), while genomic models had moderate performance (AUC 0.681). The joint distribution was best captured by a Gaussian copula (bootstrap $p=0.997$), which suggests a symmetric, moderately strong positive relationship. When we grouped patients based on this relationship, Kaplan-Meier curves showed clear differences: patients who were high-risk in both clinical and genomic scores had much poorer survival than those high-risk in only one set. These results show that copula-based fusion works in real-world cohorts and that considering dependencies between scores can better identify patient subgroups with the worst prognosis.
- North America > United States > New York (0.04)
- North America > United States > Minnesota > Blue Earth County > Mankato (0.04)
- North America > United States > Louisiana > Lafayette Parish > Lafayette (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.83)
- Asia > Middle East > Jordan (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Neural Mutual Information Estimation with Vector Copulas
Chen, Yanzhi, Ou, Zijing, Weller, Adrian, Gutmann, Michael U.
Estimating mutual information (MI) is a fundamental task in data science and machine learning. Existing estimators mainly rely on either highly flexible models (e.g., neural networks), which require large amounts of data, or overly simplified models (e.g., Gaussian copula), which fail to capture complex distributions. Drawing upon recent vector copula theory, we propose a principled interpolation between these two extremes to achieve a better trade-off between complexity and capacity. Experiments on state-of-the-art synthetic benchmarks and real-world data with diverse modalities demonstrate the advantages of the proposed estimator.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
Pan, Wenbo, Xu, Jie, Chen, Qiguang, Dong, Junhao, Qin, Libo, Li, Xinfeng, Yu, Haining, Jia, Xiaohua
Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability. However, existing metrics fail to faithfully measure this ability. On the one hand, simple refusal-based metrics are biased by refusal rates and yield inconsistent scores when models exhibit different refusal tendencies. On the other hand, existing calibration metrics are proxy-based, capturing the performance of auxiliary calibration processes rather than the model's actual refusal behavior. In this work, we propose the Refusal Index (RI), a principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. To make RI practically measurable, we design a lightweight two-pass evaluation method that efficiently estimates RI from observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's intrinsic knowledge-aware refusal capability in factual tasks. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile. This finding highlights the need to complement traditional accuracy metrics with the Refusal Index for comprehensive factuality evaluation.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Singapore (0.04)
- Asia > China > Hong Kong (0.04)
- (3 more...)